Please, use some color other than those listed to highlight answers to my comments. I would also ask you not to change, move or delete my comments so that it would be easier for me to navigate during the next review.
In addition, my comments are defined as headings. They can mess up the content, however, they are convenient, since you can immediately go to them. I will remove the headings from my comments in the next review.
**A few words about the project:** you did a great job, everything is clear and neat, there are analysis and intermediate conclusions. The project has a lot of graphs, which is great. I still have some questions that I've written in my comments. I've also left there some recommendations for improving the project. If you hava any questions, feel free to ask me.
**About the presentation:**
fig.savefig('fig.png', dpi=150) with dpi parameter that represents a resolution in dots per inch.</div>
Project Description
You’ve decided to open a small robot-run cafe in Los Angeles. The project is promising but expensive, so you and your partners decide to try to attract investors. They’re interested in the current market conditions — will you be able to maintain your success when the novelty of robot waiters wears off?
Tasks
Instructions for completing the project
Step 1. Download the data and prepare it for analysis
Step 2. Data analysis: Perform the following tasks:
Draw Conclusions
Data Description
Plan of Action
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
from scipy import stats as st
from plotly import graph_objects as go
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
sns.set_palette('bright')
Read data as a dataframe and examine original data
# create dataframe to examine original data
cafe_data_orig = pd.read_csv('datasets/rest_data_us.csv')
Examine original df
cafe_data_orig.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9651 entries, 0 to 9650 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 9651 non-null int64 1 object_name 9651 non-null object 2 address 9651 non-null object 3 chain 9648 non-null object 4 object_type 9651 non-null object 5 number 9651 non-null int64 dtypes: int64(2), object(4) memory usage: 452.5+ KB
cafe_data_orig.head()
| id | object_name | address | chain | object_type | number | |
|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 |
| 2 | 11788 | STREET CHURROS | 6801 HOLLYWOOD BLVD # 253 | False | Fast Food | 20 |
| 3 | 11789 | TRINITI ECHO PARK | 1814 W SUNSET BLVD | False | Restaurant | 22 |
| 4 | 11790 | POLLEN | 2100 ECHO PARK AVE | False | Restaurant | 20 |
cafe_data_orig['id'].nunique()
9651
cafe_data_orig['object_name'].nunique()
8672
cafe_data_orig['address'].nunique()
8517
cafe_data_orig['chain'].value_counts(dropna = False)
False 5972 True 3676 NaN 3 Name: chain, dtype: int64
null_query = cafe_data_orig.loc[cafe_data_orig['chain'].isnull()]
null_query
| id | object_name | address | chain | object_type | number | |
|---|---|---|---|---|---|---|
| 7408 | 19194 | TAQUERIA LOS 3 CARNALES | 5000 E WHITTIER BLVD | NaN | Restaurant | 14 |
| 7523 | 19309 | JAMMIN JIMMY'S PIZZA | 1641 FIRESTONE BLVD | NaN | Pizza | 1 |
| 8648 | 20434 | THE LEXINGTON THEATER | 129 E 3RD ST | NaN | Restaurant | 35 |
search_1 = cafe_data_orig.query('object_name == "TAQUERIA LOS 3 CARNALES"')
search_2 = cafe_data_orig.query('object_name == "JAMMIN JIMMY\'S PIZZA"')
search_3 = cafe_data_orig.query('object_name == "THE LEXINGTON THEATER"')
display(search_1)
display(search_2)
display(search_3)
| id | object_name | address | chain | object_type | number | |
|---|---|---|---|---|---|---|
| 7408 | 19194 | TAQUERIA LOS 3 CARNALES | 5000 E WHITTIER BLVD | NaN | Restaurant | 14 |
| id | object_name | address | chain | object_type | number | |
|---|---|---|---|---|---|---|
| 7523 | 19309 | JAMMIN JIMMY'S PIZZA | 1641 FIRESTONE BLVD | NaN | Pizza | 1 |
| id | object_name | address | chain | object_type | number | |
|---|---|---|---|---|---|---|
| 8648 | 20434 | THE LEXINGTON THEATER | 129 E 3RD ST | NaN | Restaurant | 35 |
cafe_data_orig['object_type'].value_counts(dropna = False)
Restaurant 7255 Fast Food 1066 Cafe 435 Pizza 320 Bar 292 Bakery 283 Name: object_type, dtype: int64
Actions performed
Dataframe info
Clean up the data and process where necessary
Plan of Action
Due to the low numbers
Save working version of the data
# create working versions of the dataframes for data analysis
cafe_data = pd.read_csv('datasets/rest_data_us.csv')
Fill null values
cafe_data.loc[cafe_data['chain'].isnull(), 'chain'] = False
cafe_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9651 entries, 0 to 9650 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 9651 non-null int64 1 object_name 9651 non-null object 2 address 9651 non-null object 3 chain 9651 non-null object 4 object_type 9651 non-null object 5 number 9651 non-null int64 dtypes: int64(2), object(4) memory usage: 452.5+ KB
Optimize datatype storage
# optimize datatype storage
cafe_data['id'] = pd.to_numeric(cafe_data['id'], downcast = 'integer')
cafe_data['object_name'] = cafe_data['object_name'].astype('category')
cafe_data['address'] = cafe_data['address'].astype('category')
cafe_data['chain'] = cafe_data['chain'].astype('bool')
cafe_data['object_type'] = cafe_data['object_type'].astype('category')
cafe_data['number'] = pd.to_numeric(cafe_data['number'], downcast = 'integer')
cafe_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9651 entries, 0 to 9650 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 9651 non-null int16 1 object_name 9651 non-null category 2 address 9651 non-null category 3 chain 9651 non-null bool 4 object_type 9651 non-null category 5 number 9651 non-null int16 dtypes: bool(1), category(3), int16(2) memory usage: 745.0 KB
Retitle columns to be more readable/user friendly
# review original data column titles
cafe_data.columns
Index(['id', 'object_name', 'address', 'chain', 'object_type', 'number'], dtype='object')
# retitle columns to be readable/user friendly
cafe_data = cafe_data.rename(columns = {'id': 'location_id', 'object_name': 'location_name', 'object_type': 'location_type', 'number': 'num_seats'})
# review updated column titles
cafe_data.columns
Index(['location_id', 'location_name', 'address', 'chain', 'location_type',
'num_seats'],
dtype='object')
Identify missing values in the dataset
print('number of missing values in df')
cafe_data.isnull().sum()
number of missing values in df
location_id 0 location_name 0 address 0 chain 0 location_type 0 num_seats 0 dtype: int64
Identify duplicate values in the dataset
duplicate_rows = cafe_data.duplicated().sum()
print('number of duplicate rows: {}'.format(duplicate_rows))
number of duplicate rows: 0
# determine if there are duplicates with the same address and name
duplicates_name_address = cafe_data[['location_name', 'address']].duplicated().sum()
print('number of duplicated addresses with the same restaurant name: {}'.format(duplicates_name_address))
number of duplicated addresses with the same restaurant name: 0
# determine if there are duplicate addresses
duplicate_address = cafe_data[cafe_data['address'].duplicated()]
duplicate_address.sort_values(by = 'address').head(20)
| location_id | location_name | address | chain | location_type | num_seats | |
|---|---|---|---|---|---|---|
| 5624 | 17410 | SOUPLANTATION # 17 | 100 N LA CIENEGA BLVD # E-3 | True | Restaurant | 199 |
| 8201 | 19987 | HOME TURF SPORTS BAR | 100 WORLD WAY | False | Bar | 32 |
| 6277 | 18063 | GORDEN BIERSCH | 100 WORLD WAY | False | Restaurant | 17 |
| 6515 | 18301 | STARBUCKS COFFEE | 100 WORLD WAY | True | Cafe | 9 |
| 6000 | 17786 | PINKBERRY #78 | 100 WORLD WAY | True | Restaurant | 3 |
| 5996 | 17782 | TORN BASIL | 100 WORLD WAY | False | Restaurant | 5 |
| 8946 | 20732 | ROCCO'S TAVERN | 1000 GAYLEY AVE | True | Restaurant | 134 |
| 6738 | 18524 | LA DODGERS CLUBHOUSE | 1000 VIN SCULLY AVE | False | Restaurant | 46 |
| 3677 | 15463 | LOBBY CAFE BAR | 1000 WILSHIRE BLVD | False | Bar | 48 |
| 7029 | 18815 | CITY CAFE REMOTE STORAGE | 1000 WILSHIRE BLVD | False | Restaurant | 16 |
| 7026 | 18812 | CITY CAFE | 1000 WILSHIRE BLVD | False | Restaurant | 47 |
| 8883 | 20669 | TANDOORI EATS | 10004 NATIONAL BLVD | False | Restaurant | 7 |
| 7036 | 18822 | ORGANIC CORNER CAFE | 1001 WILSHIRE BLVD | False | Restaurant | 6 |
| 9444 | 21230 | WIENERSCHNITZEL #228 | 1002 S ATLANTIC BLVD | False | Restaurant | 4 |
| 9591 | 21377 | TACOS Y GUISADOS EL CHARRO | 10024 S FIGUEROA ST | True | Fast Food | 14 |
| 9417 | 21203 | CARLS JR #97 | 1005 S FAIRFAX AVE | False | Restaurant | 94 |
| 7634 | 19420 | PORK JANGGUN | 1008 S ST ANDREWS PL | False | Restaurant | 32 |
| 7191 | 18977 | GARLO'S PIES | 1010 GLENDON AVE | False | Restaurant | 28 |
| 7551 | 19337 | LOS ANGELES COUNTRY CLUB | 10101 WILSHIRE BLVD | False | Restaurant | 4 |
| 6206 | 17992 | DRAGON LOCO | 1011 W FLORENCE AVE | False | Restaurant | 25 |
# determine the number of duplicate addresses
address_duplicates = cafe_data['address'].duplicated().sum()
print('number of duplicate addresses: {}'.format(address_duplicates))
number of duplicate addresses: 1134
# determine the percent of duplicate addresses
total_entries = len(cafe_data)
percent_duplicate_addresses = (address_duplicates / total_entries)
print('percent of duplicate addresses: {:.2%}'.format(percent_duplicate_addresses))
percent of duplicate addresses: 11.75%
Actions Performed
4.1 Investigate the proportions of the various types of establishments. Plot a graph.
# determine the number of establishments per type
type_count = cafe_data.groupby('location_type').agg({'location_id': 'count'}).reset_index()
type_count.columns = ['location_type', 'num_per_type']
type_count.sort_values(by = 'num_per_type')
| location_type | num_per_type | |
|---|---|---|
| 0 | Bakery | 283 |
| 1 | Bar | 292 |
| 4 | Pizza | 320 |
| 2 | Cafe | 435 |
| 3 | Fast Food | 1066 |
| 5 | Restaurant | 7255 |
# plot a graph
plt.figure(figsize = (12, 6))
# orginal line of code
# type_graph = sns.barplot(data = type_count, x = 'location_type', y = 'num_per_type')
# sorted graph
type_graph = sns.barplot(data = type_count, x = 'location_type', y = 'num_per_type', order = type_count.sort_values('num_per_type').location_type)
plt.title('number of locations per restaurant type', size = 15)
plt.xlabel('location type', size = 15)
plt.ylabel('number of locations', size = 15)
for bar in type_graph.patches:
type_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
# create a variable to hold the type labels
rest_labels = type_count['location_type']
# create a variable to hold the values
rest_type_count = type_count['num_per_type']
# plot the data
type_pie = go.Figure(data = [go.Pie(labels = rest_labels, values = rest_type_count)])
type_pie.update_layout(title = 'Proportion of Establishment Types')
type_pie.show()
4.1 CONCLUSION: Investigate the proportions of the various types of establishments. Plot a graph.
4.2 Investigate the proportions of chain and nonchain establishments. Plot a graph.
# calculate counts of chains and non-chains
chains = cafe_data.groupby('chain').agg({'location_id': 'count'}).reset_index()
chains.columns = ['chain', 'count']
chains
| chain | count | |
|---|---|---|
| 0 | False | 5975 |
| 1 | True | 3676 |
# create a varaible to hold the labels for the graph
chain_labels = ['Non-Chain', 'Chain']
# create a variable to hold the values for the graph
chain_type_count = chains['count']
# graph the data
chain_pie = go.Figure(data = [go.Pie(labels = chain_labels, values = chain_type_count)])
chain_pie.update_layout(title = 'Chain vs Non-Chain Establishments')
chain_pie.show()
4.2 CONCLUSION: Investigate the proportions of chain and nonchain establishments. Plot a graph.
4.3 Which type of establishment is typically a chain?
# review the data
cafe_data.head(2)
| location_id | location_name | address | chain | location_type | num_seats | |
|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 |
# create a table with the number of chains/non-chains per establishment type
type_breakdown = cafe_data.groupby(['location_type', 'chain']).agg({'location_id': 'count'}).reset_index()
type_breakdown.columns = ['location_type', 'chain', 'type_count']
type_breakdown
| location_type | chain | type_count | |
|---|---|---|---|
| 0 | Bakery | False | 0 |
| 1 | Bakery | True | 283 |
| 2 | Bar | False | 215 |
| 3 | Bar | True | 77 |
| 4 | Cafe | False | 169 |
| 5 | Cafe | True | 266 |
| 6 | Fast Food | False | 461 |
| 7 | Fast Food | True | 605 |
| 8 | Pizza | False | 167 |
| 9 | Pizza | True | 153 |
| 10 | Restaurant | False | 4963 |
| 11 | Restaurant | True | 2292 |
# combine the tables
type_breakdown = pd.merge(type_breakdown, type_count, on = 'location_type')
# calculate the chain/non-chain percentages
type_breakdown['percent'] = round((type_breakdown['type_count'] / type_breakdown['num_per_type']) * 100)
type_breakdown
| location_type | chain | type_count | num_per_type | percent | |
|---|---|---|---|---|---|
| 0 | Bakery | False | 0 | 283 | 0.0 |
| 1 | Bakery | True | 283 | 283 | 100.0 |
| 2 | Bar | False | 215 | 292 | 74.0 |
| 3 | Bar | True | 77 | 292 | 26.0 |
| 4 | Cafe | False | 169 | 435 | 39.0 |
| 5 | Cafe | True | 266 | 435 | 61.0 |
| 6 | Fast Food | False | 461 | 1066 | 43.0 |
| 7 | Fast Food | True | 605 | 1066 | 57.0 |
| 8 | Pizza | False | 167 | 320 | 52.0 |
| 9 | Pizza | True | 153 | 320 | 48.0 |
| 10 | Restaurant | False | 4963 | 7255 | 68.0 |
| 11 | Restaurant | True | 2292 | 7255 | 32.0 |
# plot the data
plt.figure(figsize = (12, 6))
type_breakdown_graph = sns.barplot(data = type_breakdown,
x = 'location_type',
y ='percent',
hue = 'chain',
edgecolor = 'w') #swap y with 'type_count' for counts instead of percent
# plt.title('Establishment Type Breakdown: Chain vs Non-Chain', size = 15)
plt.title('Percentage of Chain and Non-Chain Establishments Per Type', size = 15)
plt.xlabel('establisment type', size = 15)
# plt.ylabel('number of establishments', size = 15)
plt.ylabel('percent of chain and non-chain per type', size = 15)
for bar in type_breakdown_graph.patches:
type_breakdown_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
# Reviewer's code
cafe_data.replace({'True': 1, 'False':0}).groupby('location_type')['chain'].agg(['sum','count', lambda x: '{:.2%}'.format(x.mean())]).rename(columns={'sum':'Chain', 'count': 'All Establishments'})
| Chain | All Establishments | <lambda_0> | |
|---|---|---|---|
| location_type | |||
| Bakery | 283 | 283 | 100.00% |
| Bar | 77 | 292 | 26.37% |
| Cafe | 266 | 435 | 61.15% |
| Fast Food | 605 | 1066 | 56.75% |
| Pizza | 153 | 320 | 47.81% |
| Restaurant | 2292 | 7255 | 31.59% |
4.3 CONCLUSION: Which type of establishment is typically a chain?
Good :)
4.4 What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?
# review data
cafe_data.head(2)
| location_id | location_name | address | chain | location_type | num_seats | |
|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 |
# make a table with establishments that are only chains
chains_only = cafe_data.query('chain == True')
chains_only.head(2)
| location_id | location_name | address | chain | location_type | num_seats | |
|---|---|---|---|---|---|---|
| 8 | 11794 | ABC DONUTS | 3027 N SAN FERNANDO RD UNIT 103 | True | Fast Food | 1 |
| 10 | 11796 | EL POLLO LOCO | 5319 W SUNSET BLVD | True | Restaurant | 38 |
# examine the number of seats in chains
chains_only['num_seats'].describe()
count 3676.000000 mean 39.694233 std 43.437212 min 1.000000 25% 13.000000 50% 25.000000 75% 44.000000 max 229.000000 Name: num_seats, dtype: float64
# calculate the average number of seats in chains
chains_avg = chains_only['num_seats'].mean()
print('average number of seats in chain establishments: {:.0f}'.format(chains_avg))
average number of seats in chain establishments: 40
# distribution plot
plt.figure(figsize = (12, 6))
sns.distplot(chains_only['num_seats'], bins=50)
plt.title('Distribution of Number of Seats in Chains')
plt.xlabel('number of seats', size = 15)
plt.ylabel('density of establishments', size = 15)
plt.show()
# calculate the number of locations that have specific number of seats
chain_seat_counts = chains_only.groupby('num_seats').agg({'location_id': 'count'}).reset_index()
chain_seat_counts.columns = ['num_seats', 'num_locations']
chain_seat_counts.head()
| num_seats | num_locations | |
|---|---|---|
| 0 | 1 | 77 |
| 1 | 2 | 72 |
| 2 | 3 | 84 |
| 3 | 4 | 73 |
| 4 | 5 | 69 |
# plot a scatter plot for easier visualization
plt.figure(figsize = (12, 6))
sns.scatterplot(data = chain_seat_counts, x = 'num_seats', y = 'num_locations', label = 'chain')
plt.title('Number of Seats in Chain Establishments', size = 15)
plt.xlabel('number of seats', size = 15)
plt.ylabel('number of establishments', size = 15)
plt.show()
# evaluate those with less than and more than 50 seats
under_fifty = chains_only.query('num_seats <= 50')
print('number of chain establishments with less than 50 seats: {}'.format(len(under_fifty)))
over_fifty = chains_only.query('num_seats > 50')
print('number of chain establishments with greater than 50 seats: {}'.format(len(over_fifty)))
under_fifty_perc = len(under_fifty) / len(chains_only)
print('{:.1%} of chain establiments have 50 seats or less'.format(under_fifty_perc))
number of chain establishments with less than 50 seats: 3033 number of chain establishments with greater than 50 seats: 643 82.5% of chain establiments have 50 seats or less
# plot the distribution of 50 seats or less
plt.figure(figsize = (12, 6))
sns.distplot(under_fifty['num_seats'])
plt.title('Distribution of Chains With Less Than 50 Seats', size = 15)
plt.xlabel('number of seats', size = 15)
plt.ylabel('density of establishments', size = 15)
plt.show()
4.4 CONCLUSION: What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?
4.5 Determine the average number of seats for each type of restaurant. On average, which type of restaurant has the greatest number of seats? Plot graphs.
# review data
cafe_data.head(2)
| location_id | location_name | address | chain | location_type | num_seats | |
|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 |
# calculate the average number of seats per establishment type
avg_seats = cafe_data.pivot_table(index = 'location_type', values = 'num_seats', aggfunc = 'mean').reset_index()
avg_seats.columns = ['location_type', 'avg_num_seats']
avg_seats['avg_num_seats'] = avg_seats['avg_num_seats'].round()
avg_seats.sort_values(by = 'avg_num_seats')
| location_type | avg_num_seats | |
|---|---|---|
| 0 | Bakery | 22.0 |
| 2 | Cafe | 25.0 |
| 4 | Pizza | 28.0 |
| 3 | Fast Food | 32.0 |
| 1 | Bar | 45.0 |
| 5 | Restaurant | 48.0 |
# plot the data
plt.figure(figsize = (12, 6))
avg_seats_graph = sns.barplot(data = avg_seats, x = 'location_type', y = 'avg_num_seats', order = avg_seats.sort_values('avg_num_seats').location_type)
plt.title('Average Number of Seats per Establishment Type', size = 15)
plt.xlabel('establishment type', size = 15)
plt.ylabel('average number of seats')
avg_seats_graph
for bar in avg_seats_graph.patches:
avg_seats_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
4.5 CONCLUSION: Determine the average number of seats for each type of restaurant. On average, which type of restaurant has the greatest number of seats? Plot graphs.
4.6 Put the data on street names from the address column in a separate column.
# review data
cafe_data.head(2)
| location_id | location_name | address | chain | location_type | num_seats | |
|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 |
# separate street number from street
cafe_data[['add_num', 'street']] = cafe_data['address'].str.split(n=1, expand = True)
cafe_data.head()
| location_id | location_name | address | chain | location_type | num_seats | add_num | street | |
|---|---|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 | 3708 | N EAGLE ROCK BLVD |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 | 100 | WORLD WAY # 120 |
| 2 | 11788 | STREET CHURROS | 6801 HOLLYWOOD BLVD # 253 | False | Fast Food | 20 | 6801 | HOLLYWOOD BLVD # 253 |
| 3 | 11789 | TRINITI ECHO PARK | 1814 W SUNSET BLVD | False | Restaurant | 22 | 1814 | W SUNSET BLVD |
| 4 | 11790 | POLLEN | 2100 ECHO PARK AVE | False | Restaurant | 20 | 2100 | ECHO PARK AVE |
# remove any numbers at the end of the street
cafe_data['street'] = cafe_data['street'].str.rstrip('# 0123456789')
cafe_data.head()
| location_id | location_name | address | chain | location_type | num_seats | add_num | street | |
|---|---|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 | 3708 | N EAGLE ROCK BLVD |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 | 100 | WORLD WAY |
| 2 | 11788 | STREET CHURROS | 6801 HOLLYWOOD BLVD # 253 | False | Fast Food | 20 | 6801 | HOLLYWOOD BLVD |
| 3 | 11789 | TRINITI ECHO PARK | 1814 W SUNSET BLVD | False | Restaurant | 22 | 1814 | W SUNSET BLVD |
| 4 | 11790 | POLLEN | 2100 ECHO PARK AVE | False | Restaurant | 20 | 2100 | ECHO PARK AVE |
# remove 1/2 labels from street
cafe_data['street'] = cafe_data['street'].str.lstrip('"1/2"')
cafe_data.head()
| location_id | location_name | address | chain | location_type | num_seats | add_num | street | |
|---|---|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 | 3708 | N EAGLE ROCK BLVD |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 | 100 | WORLD WAY |
| 2 | 11788 | STREET CHURROS | 6801 HOLLYWOOD BLVD # 253 | False | Fast Food | 20 | 6801 | HOLLYWOOD BLVD |
| 3 | 11789 | TRINITI ECHO PARK | 1814 W SUNSET BLVD | False | Restaurant | 22 | 1814 | W SUNSET BLVD |
| 4 | 11790 | POLLEN | 2100 ECHO PARK AVE | False | Restaurant | 20 | 2100 | ECHO PARK AVE |
4.7 Plot a graph of the top ten streets by number of restaurants.
# count the number of locations per street
street_counts = cafe_data.pivot_table(index = 'street', values = 'location_id', aggfunc = 'count').reset_index()
street_counts.columns = ['street', 'num_per_street']
street_counts.head()
| street | num_per_street | |
|---|---|---|
| 0 | 3RD ST | 1 |
| 1 | 7TH ST | 1 |
| 2 | BEVERLY BLVD | 3 |
| 3 | CITY TERRACE DR | 1 |
| 4 | COLORADO BLVD | 1 |
# create a table of the top 10 streets with the most locations
top_ten = street_counts.sort_values('num_per_street', ascending = False).reset_index()
top_ten_table = top_ten.head(10)
top_ten_table
| index | street | num_per_street | |
|---|---|---|---|
| 0 | 1635 | W SUNSET BLVD | 305 |
| 1 | 1601 | W PICO BLVD | 305 |
| 2 | 1733 | WILSHIRE BLVD | 223 |
| 3 | 427 | HOLLYWOOD BLVD | 194 |
| 4 | 1236 | SANTA MONICA BLVD | 191 |
| 5 | 1185 | S WESTERN AVE | 183 |
| 6 | 913 | S FIGUEROA ST | 173 |
| 7 | 1378 | W 3RD ST | 173 |
| 8 | 1159 | S VERMONT AVE | 165 |
| 9 | 1581 | W OLYMPIC BLVD | 155 |
# plot the data
plt.figure(figsize = (12, 6))
top_ten_graph = sns.barplot(data = top_ten_table, x = 'street', y = 'num_per_street')
plt.title('Top 10: Streets With The Most Establishments', size = 15)
plt.xlabel('street', size = 15)
# remove this code and replace with plt.xticks(rotation = 30) if add the order parameter to the sns.barplot(... order = ...)
top_ten_graph.set_xticklabels(labels = top_ten_table['street'], rotation = 30)
plt.ylabel('number of locations per street', size = 15)
for bar in top_ten_graph.patches:
top_ten_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
4.7 CONCLUSION: Plot a graph of the top ten streets by number of restaurants.
4.8 Find the number of streets that only have one restaurant.
# calculate the number of streets that only have one restaurant
one_per_street = len(street_counts.query('num_per_street == 1'))
print(one_per_street, 'streets have only one restaurant')
1213 streets have only one restaurant
4.8 CONCLUSION: Find the number of streets that only have one restaurant.
4.9 For streets with a lot of restaurants, look at the distribution of the number of seats. What trends can you see?
# review data
cafe_data.head(2)
| location_id | location_name | address | chain | location_type | num_seats | add_num | street | |
|---|---|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 | 3708 | N EAGLE ROCK BLVD |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 | 100 | WORLD WAY |
# make a list of top 10 streets
top_ten_list = list(top_ten_table['street'])
top_ten_list
['W SUNSET BLVD', 'W PICO BLVD', 'WILSHIRE BLVD', 'HOLLYWOOD BLVD', 'SANTA MONICA BLVD', 'S WESTERN AVE', 'S FIGUEROA ST', 'W 3RD ST', 'S VERMONT AVE', 'W OLYMPIC BLVD']
# obtain the data on the top 10 streets
top_ten_data = cafe_data.query('street in @top_ten_list')
top_ten_data.head()
| location_id | location_name | address | chain | location_type | num_seats | add_num | street | |
|---|---|---|---|---|---|---|---|---|
| 2 | 11788 | STREET CHURROS | 6801 HOLLYWOOD BLVD # 253 | False | Fast Food | 20 | 6801 | HOLLYWOOD BLVD |
| 3 | 11789 | TRINITI ECHO PARK | 1814 W SUNSET BLVD | False | Restaurant | 22 | 1814 | W SUNSET BLVD |
| 10 | 11796 | EL POLLO LOCO | 5319 W SUNSET BLVD | True | Restaurant | 38 | 5319 | W SUNSET BLVD |
| 16 | 11802 | HMS BOUNTY | 3357 WILSHIRE BLVD | False | Restaurant | 147 | 3357 | WILSHIRE BLVD |
| 22 | 11808 | CAMY'S GRILL | 3339 WILSHIRE BLVD | False | Restaurant | 39 | 3339 | WILSHIRE BLVD |
# calculate the number of establishments per street
top_ten_counts = top_ten_data.pivot_table(index = 'street', values = 'location_id', aggfunc = 'count').reset_index()
top_ten_counts.columns = ['street', 'num_rest_per_street']
top_ten_counts
| street | num_rest_per_street | |
|---|---|---|
| 0 | HOLLYWOOD BLVD | 194 |
| 1 | S FIGUEROA ST | 173 |
| 2 | S VERMONT AVE | 165 |
| 3 | S WESTERN AVE | 183 |
| 4 | SANTA MONICA BLVD | 191 |
| 5 | W 3RD ST | 173 |
| 6 | W OLYMPIC BLVD | 155 |
| 7 | W PICO BLVD | 305 |
| 8 | W SUNSET BLVD | 305 |
| 9 | WILSHIRE BLVD | 223 |
# calculate the number of seats per street
top_ten_seats = top_ten_data.pivot_table(index = 'street', values = 'num_seats', aggfunc = ['sum', 'mean', 'median']).reset_index()
top_ten_seats.columns = ['street', 'total_seats_per_street', 'avg_seats_per_street', 'median_seats_per_street']
top_ten_seats
| street | total_seats_per_street | avg_seats_per_street | median_seats_per_street | |
|---|---|---|---|---|
| 0 | HOLLYWOOD BLVD | 10808 | 55.711340 | 36 |
| 1 | S FIGUEROA ST | 8276 | 47.838150 | 26 |
| 2 | S VERMONT AVE | 7480 | 45.333333 | 28 |
| 3 | S WESTERN AVE | 7745 | 42.322404 | 28 |
| 4 | SANTA MONICA BLVD | 6840 | 35.811518 | 22 |
| 5 | W 3RD ST | 7062 | 40.820809 | 28 |
| 6 | W OLYMPIC BLVD | 8199 | 52.896774 | 28 |
| 7 | W PICO BLVD | 12342 | 40.465574 | 26 |
| 8 | W SUNSET BLVD | 15494 | 50.800000 | 35 |
| 9 | WILSHIRE BLVD | 12515 | 56.121076 | 37 |
# create a table of seats and establishments per street
top_ten_table = pd.merge(top_ten_counts, top_ten_seats, on = 'street')
top_ten_table
| street | num_rest_per_street | total_seats_per_street | avg_seats_per_street | median_seats_per_street | |
|---|---|---|---|---|---|
| 0 | HOLLYWOOD BLVD | 194 | 10808 | 55.711340 | 36 |
| 1 | S FIGUEROA ST | 173 | 8276 | 47.838150 | 26 |
| 2 | S VERMONT AVE | 165 | 7480 | 45.333333 | 28 |
| 3 | S WESTERN AVE | 183 | 7745 | 42.322404 | 28 |
| 4 | SANTA MONICA BLVD | 191 | 6840 | 35.811518 | 22 |
| 5 | W 3RD ST | 173 | 7062 | 40.820809 | 28 |
| 6 | W OLYMPIC BLVD | 155 | 8199 | 52.896774 | 28 |
| 7 | W PICO BLVD | 305 | 12342 | 40.465574 | 26 |
| 8 | W SUNSET BLVD | 305 | 15494 | 50.800000 | 35 |
| 9 | WILSHIRE BLVD | 223 | 12515 | 56.121076 | 37 |
# plot the data
print('Correlation Between Number of Seats and Number of Establishments per Street')
# plt.figure(figsize = (12, 6))
top_ten_graph = sns.jointplot(data = top_ten_table, x = 'total_seats_per_street', y = 'num_rest_per_street', kind = 'reg')
# plt.title('Correlation Between Number of Seats and Number of Establishments per Street')
plt.xlabel('seats per street', size = 15)
plt.ylabel('establishments per street', size = 15)
top_ten_graph
plt.show()
Correlation Between Number of Seats and Number of Establishments per Street
# distribution of the number of seats on top 10 streets
plt.figure(figsize = (12, 6))
sns.distplot(top_ten_data['num_seats'])
plt.title('Seat Distribution on Top 10 Streets', size = 15)
plt.xlabel('number of seats', size = 15)
plt.ylabel('density', size = 15)
plt.show()
# under 50, top 10 seat distributuion
plt.figure(figsize = (12, 6))
under_fifty_top_ten = top_ten_data.query('num_seats <= 50')
sns.distplot(under_fifty_top_ten['num_seats'])
plt.title('Seat Distribution on Top 10 Streets with 50 Seats or Less')
plt.xlabel('number of seats', size = 15)
plt.ylabel('density', size = 15)
plt.show()
4.9 CONCLUSION: For streets with a lot of restaurants, look at the distribution of the number of seats. What trends can you see?
# review the data
top_ten_data.head(2)
| location_id | location_name | address | chain | location_type | num_seats | add_num | street | |
|---|---|---|---|---|---|---|---|---|
| 2 | 11788 | STREET CHURROS | 6801 HOLLYWOOD BLVD # 253 | False | Fast Food | 20 | 6801 | HOLLYWOOD BLVD |
| 3 | 11789 | TRINITI ECHO PARK | 1814 W SUNSET BLVD | False | Restaurant | 22 | 1814 | W SUNSET BLVD |
top_ten_data['num_seats'].describe()
count 2067.000000 mean 46.812288 std 48.987439 min 1.000000 25% 14.500000 50% 29.000000 75% 48.000000 max 229.000000 Name: num_seats, dtype: float64
# review data
top_ten_data.head(2)
| location_id | location_name | address | chain | location_type | num_seats | add_num | street | |
|---|---|---|---|---|---|---|---|---|
| 2 | 11788 | STREET CHURROS | 6801 HOLLYWOOD BLVD # 253 | False | Fast Food | 20 | 6801 | HOLLYWOOD BLVD |
| 3 | 11789 | TRINITI ECHO PARK | 1814 W SUNSET BLVD | False | Restaurant | 22 | 1814 | W SUNSET BLVD |
# create a table with the number of establishment types in the top 10 streets
top_ten_type_counts = top_ten_data.groupby('location_type').agg({'location_id': 'count'}).reset_index()
top_ten_type_counts.columns = ['location_type', 'num_per_type']
top_ten_type_counts
| location_type | num_per_type | |
|---|---|---|
| 0 | Bakery | 67 |
| 1 | Bar | 59 |
| 2 | Cafe | 82 |
| 3 | Fast Food | 177 |
| 4 | Pizza | 81 |
| 5 | Restaurant | 1601 |
# plot the data
# create a variable to hold the location information
top_ten_rest_labels = top_ten_type_counts['location_type']
# create a variable to hold the value information
top_ten_type_count = top_ten_type_counts['num_per_type']
top_ten_type_pie = go.Figure(data = [go.Pie(labels = top_ten_rest_labels, values = top_ten_type_count)])
top_ten_type_pie.update_layout(title = 'Top 10 Proportion of Establishment Types')
top_ten_type_pie.show()
Conclusion
top_ten_chain_type = top_ten_data.groupby('chain').agg({'location_id': 'count'}).reset_index()
top_ten_chain_type.columns = ['chain', 'count']
top_ten_chain_type.head(2)
| chain | count | |
|---|---|---|
| 0 | False | 1331 |
| 1 | True | 736 |
# create a varaible to hold the labels for the graph
top_ten_chain_labels = ['Non-Chain', 'Chain']
# create a variable to hold the values for the graph
top_ten_chain_type_count = top_ten_chain_type['count']
# graph the data
top_ten_chain_pie = go.Figure(data = [go.Pie(labels = top_ten_chain_labels, values = top_ten_chain_type_count)])
top_ten_chain_pie.update_layout(title = 'Chain vs Non-Chain Establishments in Top 10 Streets')
top_ten_chain_pie.show()
Conclusion
# query for the cafe location type
top_ten_cafes_query = top_ten_data.query('location_type == "Cafe"')
top_ten_cafes_query
| location_id | location_name | address | chain | location_type | num_seats | add_num | street | |
|---|---|---|---|---|---|---|---|---|
| 155 | 11941 | MILLIE'S COFFEE SHOP | 3524 W SUNSET BLVD | False | Cafe | 22 | 3524 | W SUNSET BLVD |
| 156 | 11942 | ANDANTE COFFEE ROASTERS | 2201 W SUNSET BLVD | True | Cafe | 4 | 2201 | W SUNSET BLVD |
| 495 | 12281 | STARBUCKS COFFEE #6688 | 1601 WILSHIRE BLVD | True | Cafe | 37 | 1601 | WILSHIRE BLVD |
| 1149 | 12935 | STARBUCKS COFFEE #546 | 10911 W PICO BLVD | True | Cafe | 42 | 10911 | W PICO BLVD |
| 1198 | 12984 | CRAFT COFFEE AND GOOD FOOD CAFE | 5636 HOLLYWOOD BLVD | True | Cafe | 25 | 5636 | HOLLYWOOD BLVD |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9281 | 21067 | STARBUCKS COFFEE #5488 | 5020 WILSHIRE BLVD | True | Cafe | 27 | 5020 | WILSHIRE BLVD |
| 9285 | 21071 | STARBUCKS-17413 | 3584 S FIGUEROA ST # #3 | True | Cafe | 31 | 3584 | S FIGUEROA ST |
| 9449 | 21235 | CLOVE COFFEE | 5007 W PICO BLVD | False | Cafe | 8 | 5007 | W PICO BLVD |
| 9598 | 21384 | GOOD PEOPLE COFFEE CO | 11609 SANTA MONICA BLVD | False | Cafe | 2 | 11609 | SANTA MONICA BLVD |
| 9635 | 21421 | THE TEA & COFFEE EXCHANGE | 6801 HOLLYWOOD BLVD # 120 | False | Cafe | 4 | 6801 | HOLLYWOOD BLVD |
82 rows × 8 columns
top_ten_cafes_query['num_seats'].describe()
count 82.000000 mean 22.097561 std 14.347202 min 2.000000 25% 8.500000 50% 22.000000 75% 33.250000 max 49.000000 Name: num_seats, dtype: float64
# create a table to count the number of cafes per street
top_ten_cafes = top_ten_cafes_query.groupby('street').agg({'location_id': 'count'}).reset_index()
top_ten_cafes.columns = ['street', 'num_cafes']
top_ten_cafes.sort_values(by = 'num_cafes', ascending = False)
| street | num_cafes | |
|---|---|---|
| 8 | W SUNSET BLVD | 21 |
| 7 | W PICO BLVD | 11 |
| 9 | WILSHIRE BLVD | 11 |
| 4 | SANTA MONICA BLVD | 10 |
| 0 | HOLLYWOOD BLVD | 8 |
| 1 | S FIGUEROA ST | 6 |
| 5 | W 3RD ST | 5 |
| 3 | S WESTERN AVE | 4 |
| 6 | W OLYMPIC BLVD | 4 |
| 2 | S VERMONT AVE | 2 |
order = avg_seats.sort_values('avg_num_seats').location_type
[top_ten_cafes.sort_values(by = 'num_cafes', ascending = False).street]
[8 W SUNSET BLVD 7 W PICO BLVD 9 WILSHIRE BLVD 4 SANTA MONICA BLVD 0 HOLLYWOOD BLVD 1 S FIGUEROA ST 5 W 3RD ST 3 S WESTERN AVE 6 W OLYMPIC BLVD 2 S VERMONT AVE Name: street, dtype: object]
# plot the data
plt.figure(figsize = (12, 6))
top_ten_cafes_graph = sns.barplot(data = top_ten_cafes, x = 'street', y = 'num_cafes',
order = top_ten_cafes.sort_values(by = 'num_cafes', ascending = False).street)
plt.title('Number of Cafes on Top 10 Streets', size = 15)
plt.xlabel('street', size = 15)
plt.ylabel('number of cafes per street', size = 15)
# this line was overriding the order paramater in sns.barplot(... order = ...)
# top_ten_cafes_graph.set_xticklabels(labels = top_ten_cafes['street'], rotation = 30)
# this allows the xticks to be rotated w/o messing with the labels
plt.xticks(rotation = 30)
for bar in top_ten_cafes_graph.patches:
top_ten_cafes_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
Conclusion
The team is investigating opening a cafe with robot servers with an interest in creating a chain establishment
Establishment types, chain/non-chain establishments, and number of seats were evaluated
General Conclusions
Overall
Recommendations
Presentation: https://drive.google.com/file/d/1pDOHgxA-n9iJ1utBtT-p7JtMx6lBUhUT/view?usp=sharing